Cluster validity analysis using subsampling
نویسندگان
چکیده
I n t r o d u c t i o n The word "clustering" (unsupervised classification) refers to methods of grouping objects based on some similarity measure between them. Clustering algorithms can be classified into four classes, namely Partitional, Hierarchical, Density-based and Grid-based [8]. Each of these classes has subclasses and different corresponding approaches, e.g., conceptual, fuzzy, selforganizing maps etc. The clustering task can be divided into the following five steps, (the last two are optional) [9]: 1) Pattern representation; 2) Pattern proximity measure definition; 3) Clustering; 4) Data abstraction; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures effectiveness of the other steps. For a given a dataset, the produced clustering depends on the parameters of the applied clustering algorithm. Usually, it is the case that different algorithms, even the same *0-7"S03-7"952-7"/03/$17".00 (~) 2003 IEEE. algorithm with distinct parameters, generate different clustering results. Cluster validity analysis refers to how to assess the confidence in the resulting clusters. For datasets with few dimensions, the clustering result can be visualized, and hence clusters can be validated by human experts. But, this becomes nearly impossible for large dimensions; and hence some other automatic methods are needed. The main criteria for the evaluation of clustering results are [8]: Compactness (i.e., members of each cluster should be closer to each other) and separation (i.e., the clusters should be widely spaced). Based on these criteria, a number of indices are proposed for evaluating clusters and selecting the best possible number of clusters. In most of the cases, assessing validity turns out into determining the best parameters for a clustering algorithm. Confidence estimation is addressed in relatively less number of research papers, where confidence is given in terms of the proportion of cases clustering together. Our motivation for the work described in this paper is estimating confidence in each cluster, i.e., not addressing specific cases. For this purpose, here we propose three meta-methods from this perspective for cluster validity problem. To the best of our knowledge, these methods are novel and test results demonstrate their effectiveness. The methods are all based on subsampling of the dataset. They are general and can be used for evaluating clustering results generated by a wide range of existing clustering algorithms. The first method, starts by producing a clustering using a given clustering algorithm, with the number of clusters specified. Second, it randomly samples from the labeled clusters. Third, it builds a supervised classifier on the selected subset, and the induced classifier evaluates the non-selected portion. Random subsampling and evaluation steps are repeated many times. Finally, the overall accuracy gives the stability of clustering. Overall, these steps are repeated for all possible number of clusters, for comparison to other clustering results by different clustering algorithms. Instead of random subsampling, 10-fold cross-validation can also be used. The second method is based on subset selection of the original clusters. First of all, clusters are found by employing a given clustering algorithm. For each subset of these clusters, an algorithm that estimates the true number of clusters is used. The argument here is that, if the given clustering is stable, then we expect the number of clusters estimated for each subset is the same as the cardinality of labels of the selected subset. The confidence is computed as the proportion of correct estimations. It may be the case that clustering result contains large number of clusters (say 20 clusters). In this case, trying all subsets becomes computationallyintractable; so we resort to subset sampling instead. If the validity of clustering results generated by randomized algorithms like k-means is the concern, all the steps should be repeated for averaging for both the first and the second methods. The third method uses the idea that if a clustering is stable, further clustering of the cases in every cluster will reveal one cluster. For each of the clusters, an estimator algorithm is run and expected to give that there is only one cluster. The whole step is repeated several times with dataset subsampling, i.e., a bootstrapping approach is employed for confidence estimation. Confidence is computed similar to the second method. The rest of the paper is organized as follows. In section 2, some background and recent work on cluster validity are given. Section 3 presents our three methods for cluster validity analysis. Experimental results are presented in Section 4. Section 5 is the conclusions. 2 Cluster Validity and Stability There are basically three methods for assessment of validity: internal, external and relative [9, 8, 7]. Internal indices measure how clustering result reflects the structure inherent in the dataset. Here, only inherent features of the dataset are used for the measurement, i.e., no external information is consulted. Usually between and within sum of square matrices are used as inherent features. There are a number of indices available, including silhouette, gap, gapPC [7]. These indices also define how to select the best number of clusters. In external assessment of validity, there is a known priori structure; an external index is computed using this structure and the generated structure. These indices define a measure of the degree of match between these two structures. The indices are usually defined on contingency tables of the two partitions. Entry, nij (row i and column j) of this table is the number of patterns that belong to cluster i in the priori partition and cluster j in the generated partition. Indices on contingency tables include Jaccard, Rand and FM. The FM measure is used in Clest algorithm; it is given below [7]. F M ( 1 ~ 2 ) ( Z n ) (1) where n Y~'~i~l Y~'~c-1 n i j, Z y~ iR= l y~c_ 1 n i'2j, n i . E ~ I nij and n.j E i ~ l nij; with R and C represent the number of clusters of priori and generated clusters, respectively. Relative assessment compares two structures and measures their relative merit. The idea is to run the clustering algorithm for possible number of parameters (e.g., for each possible number of clusters) and identify the clustering scheme that best fits the dataset. Recent work on cluster validity research concentrates on a kind of relative indices called cluster stability [2, 3, 11, 12, 7, 13, 15]. Cluster stability exploits the fact that when multiple data sources are sampled from the same distribution, the clustering algorithms are expected to behave in the same way and produce similar structures. In the work described in [13], supervised predictors are built on each clustered resampling of the original dataset and their match with the original clustering labeling are used as measures of stability or degree of match. They show that selection of supervised classification algorithm does make a difference, but measured validity is still valid for other choices. They define an instability measure for taking the game-theoretic approach. The number of clusters minimizing this instability measure is used as best cluster count in the dataset. The work described in [13] presents an algorithm for estimating the true number of clusters. For each cluster count, the dataset is resampled twice and clustered using the same generic clustering algorithm. Similarity between these two clustering is measured using either Jaccard coefficient or matching coefficient. The resampling and similarity computations are repeated many times for each number of clusters for confidence estimation. The averaged values are used as measures of stability of clustering generated by the given clustering algorithm. The histograms and cumulative distributions are generated and plotted for selecting best cluster count. Smallest stable cluster count is estimated as the correct number of clusters; the decision is obvious in cumulative distributions diagram. They also give a measure for automating this process. The algorithm has a nice property that if there is no large gap between similarities across all cluster counts, it is said that the dataset does not tend for clustering, i.e., cluster count is 1. Another resampling based method is given in [12]. In their settings, original dataset is clustered first and a number of subsamples are gathered and each of them clustered independently using the same clustering algorithm. A figure of merit measure (i.e., degree of match in the connectivity matrix) is defined between the original clustering and each of the subsampled clusterings. The figure of merit is computed for each possible number of parameter sets. The plot of the figure of merit measure against parameter values is used for selecting the best parameters. A G aussian finite mixture based method for estimating true number of clusters is described in [14]. The algorithm first divides the dataset into training and test subsets. Next, for each cluster count k, a model is fitted to training set using Expectation Maxmizat ion (ML) algorithm. Then, resulting parameter set is evaluated on test set. These steps are repeated many times and averaged. These averages are used for estimating the true number of clusters. 3 T h e P r o p o s e d T h r e e M e t h o d s For the methods discussed in this section, we denote the input dataset by T having n patterns each of them having p dimensions. So, T is effectively a n x p matrix. The proposed algorithms can be used for different number of cluster counts and different clustering, even generated by different clustering algorithms. We collect their confidence measure for all possible number of clusters. These data can be used for relative confidence estimation of clustering algorithms on the given dataset. Any clustering algorithm operating on numeric values (e.g., k-means, ORCLUS, PAM, CLARA) having the cluster count as a parameter can be used in confidence estimation. For randomized algorithms like k-means confidences should be averaged on several runs. The ORCLUS algorithm is proposed for highdimensional datasets. The idea behind the algorithm is finding (potentially) different arbitrarily projected subspaces for each of the clusters. It is an iterative algorithm and starts with an initial partitions and original axis-system. In each iteration, first of all patterns are assigned to a cluster based on their projected distance to seeds of current clustering. Then, centroids of clusters (seeds) are recomputed and the new projected subspaces are computed for each of the clusters. Following this, closer seeds are merged to obtain less number of clusters. Iteration continues until user-specified number of clusters is found and the projected subspace dimensionality of each cluster reaches user-specified minimum. Contrary to feature selection methods which select dimensions in the larger eigen-values, this algorithm selects smaller eigen-value subspaces. The reason behind this is to reduce the variability in the projected subspace, i.e., reduce the distance within cluster. The algorithm has capabilities of detecting outliers and scales to very large databases, for details see [1]. 3.1 The First Method: Supervised learning based approach This method validates the result of clustering with supervised classifiers. The rationale behind this method is that if the labels generated by clustering algorithm is valid (i.e., clusters are well-separated), then they can be used by the classifier to classify clusters with high accuracy. So, this accuracy information can be used for comparing different clustering algorithms with the same input parameters. Additionally, repeated measurements of accuracies on perturbed dataset can be used for estimating the validity of clustering algorithms. Doing so facilitates the measurement of confidence in cluster validity for multiple (not just two) clustering algorithms on the same basis. The classifier is trained on perturbed version of labeled patterns, and its accuracy is tested on the patterns not selected for training. For confidence estimation, the subsampling is repeated many times. The average accuracy is used as a measure of confidence in the validity of clustering. The whole process is sketched next in Algorithm 3.1. A l g o r i t h m 3.1 ( S u p e r v i s e d l e a rn ing based method) Input:T=dataset, K=number of clusters, B=number of subsampling
منابع مشابه
Supplement to “ On the Uniform Asymptotic Validity of Subsampling and the Bootstrap ”
This document provides additional details and proofs for many of the results in the authors’ paper “On the Asymptotic Validity of Subsampling and the Bootstrap”.
متن کاملK-sample subsampling in general spaces: The case of independent time series
The problem of subsampling in two-sample and K-sample settings is addressed where both the data and the statistics of interest take values in general spaces. We focus on the case where each sample is a stationary time series, and construct subsampling confidence intervals and hypothesis tests with asymptotic validity. Some examples are also given, and the problem of optimal block size choice is...
متن کاملComments on: Subsampling weakly dependent time series and application to extremes
Professors Doukhan, Prohl, and Robert are to be congratulated for their work on extending the validity of the subsampling method to a much wider class of processes compared to the existing literature that typically requires the processes to be strongly mixing (cf. Politis et al. 1999). As described in Sects. 1 and 2, many common time series models, including the ARMA models, often fail to satis...
متن کاملK-sample Subsampling
The problem of subsampling in two-sample and K-sample settings is addressed where both the data and the statistics of interest take values in general spaces. We show the asymptotic validity of subsampling confidence intervals and hypothesis tests in the case of independent samples, and give a comparison to the bootstrap in the K-sample setting.
متن کاملTraining Set Compression by Incremental Clustering
Compression of training sets is a technique for reducing training set size without degrading classification accuracy. By reducing the size of a training set, training will be more efficient in addition to saving storage space. In this paper, an incremental clustering algorithm, the Leader algorithm, is used to reduce the size of a training set by effectively subsampling the training set. Experi...
متن کامل